index.md - lectures.alex.balgavy.eu - Lecture notes from university.

index.md (5542B)
      1 +++
      2 title = 'Matrix models'
      3 template = 'page-math.html'
      4 +++
      5 # Matrix models
      6 
      7 ## Recommender systems
      8 using the example of movie recommendations for users.
      9 
     10 Recommendation using only explicit info:
     11 - we have no representation for users or movies, only 'atomic' objects that we want to compare for similarity
     12 - we saw this with word embedding, represented each word by its own vector and optimised values of vectors
     13 
     14 embedding models:
     15 - train length k embedding for each user, and one for each movie, and arrange into matrices U and M.
     16 - the matrices are parameters of the model
     17 
     18 ![9277af676d9256fb745e41d8cf59dcd1.png](caf202f13ca04b06a4a3d1e9e8a3c702.png)
     19 
     20 to make a prediction, define that dot product of user vector and movie vector should be high if user would like the movie.
     21 this is a choice, but it's a logical one.
     22 
     23 multiplying U<sup>T</sup> with M gives matrix of rating predictions for every user/movie pair.
     24 so we want to take rating matrix R, and decompose it as product of two factors ("matrix factorization/decomposition")
     25 
     26 ## Matrix factorization
     27 You get an optimisation problem: choose U and M st U<sup>T</sup> M ≈ R.
     28 
     29 $\begin{aligned}
     30 \argmin_{U,M} &= \| R - U^T M \|_{F}^{2} \\
     31 &= \sum_{i,j} (R_{ij} - \lbrack U^T M \rbrack_{ij})^2 \\
     32 &= \sum_{i,j} (R_{ij} - u_{i}<sup>{T} m_j)</sup>2
     33 \end{aligned}$
     34 
     35 but, R is not complete, for most user/movie pairs we don't know the rating.
     36 so, sometimes it's better to only optimise for known ratings:
     37 
     38 $\begin{aligned} \argmin_{U,M} \sum_{i,j \in R_{\text{known}}} (R_{ij} - u_{i}<sup>{T} m_j)</sup>2 \end{aligned}$
     39 
     40 Alternating least squares - alternative to gradient descent
     41 - idea: if we know M, computing U is easy, and vice versa
     42 - so, starting with random U and m:
     43     - fix M, compute new U
     44     - fix U, compute new M
     45 
     46 Stochastic gradient descent is usually more practical.
     47 
     48 If we only have positive ratings, we have two options:
     49 - ensure that U<sup>T</sup> M always represent probabilities; maximise probability of data.
     50 - sample random movie user pairs as negative training samples (i.e., assume that users don't really know)
     51 
     52 ### Bias control
     53 - control for user bias
     54     - ratings depend between users, they're subjective (no shit)
     55     - if we can explicitly model bias of a user, the matrix factorisation only needs to predict how much a user would deviate from their average rating for a particular movie
     56 - control for movie bias
     57     - some movies are universally hated, some are universally loved
     58     - unpopular opinion: The Rise of Skywalker wasn't really that bad
     59 - control for temporal bias
     60     - data is not stable over time
     61     - e.g. meaning of specific ratings can change
     62     -
     63 For user/movie biases, use an additive scalar, which is learned along with embeddings.
     64 One for each user, one for each movie, and one general bias over all ratings.
     65 
     66 Make the biases and embeddings time dependent.
     67 e.g. cut time into a small number of chunks, learn separate embedding for each chunk.
     68 
     69 ### The 'cold start' problem
     70 When a user/movie is added to the database, we have no ratings, so matrix factorization can't build an embedding.
     71 Have to rely on implicit feedback and side information.
     72 
     73 Using implicit "likes" (e.g. movies watched but not rated, movies browsed, movies hovered over...)
     74 - add separate movie embedding x, compute second user embedding
     75     - this is sum of x-embeddings of all movies user x has liked
     76 - $u_{i}^{imp} = \sum_{j \in N(i)} x_j$
     77 - add this implicit feedback embedding to existing one before computing dot product
     78 
     79 Using side info (age, login, browser resolution...)
     80 - for simplification, assume all features are binary
     81 - assign each feature an embedding, sum over all features that apply to user, creating third user embedding
     82 - $u_{i}^{side} = \sum_{f \in A(i)} y_f$
     83 - then you add this before computing dot product
     84 
     85 ## Graph models
     86 graph convolutional network:
     87 - start with node embeddings N₀
     88 - compute new embedding for each node: average of neighbor embeddings and its own - AN₀
     89 - multiply by weight matrix W, add activation - N₁ = σ(AN₀W)
     90 - if you apply this multiple times, you get a multi-layered structure
     91 
     92 Link prediction: assume graph is incomplete, try to predict which nodes should be linked.
     93 We can treat this like a matrix factorization problem.
     94 GCN: perform convolutions on embeddings, multiply them out to generate predictions, compare to training data, and backpropagate loss
     95 
     96 Node classification: for each node, try to predict a label.
     97 With node embeddings, we can use a regular classifier, but how do we get good embeddings?
     98 GCN for classification: ensure embedding size of last layer is equal to number of classes, apply softmax activation to embeddings, interpret them as probabilities over classes
     99 
    100 GCN issues:
    101 - depth is a problem, high connectivity diffuses info
    102 - usually full-batch, can't easily break up graph into minibatches
    103 - pooling not selective, all neighbors mixed equally before weights are applied
    104 
    105 ## Validating embedding models
    106 inductive vs transductive learning: in transduction, learning is allowed to see features of all data, but labels of only training data.
    107 embedding models only support transductive learning; if we don't know objects until after training, won't have embedding vectors.
    108 like with graph models, we need to know the _whole_ graph.
    109 
    110 to evaluate matrix factorization, give training alg all users and movies, but withhold some ratings.
    111 same for links in a graph.
    112 for node classification, give the alg the whole graph, and table linking node ids to labels, but withhold some labels.
	lectures.alex.balgavy.eu Lecture notes from university.
	git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
	Log \| Files \| Refs \| Submodules